Incremental recomputations in materialized data integration

نویسنده

  • Thomas Jörg
چکیده

Data integration aims at providing uniform access to heterogeneous data, managed by distributed source systems. Data sources can range from legacy systems, databases, and enterprise applications to web-scale data management systems. The materialized approach to data integration, extracts data from the sources, transforms and consolidates the data, and loads it into an integration system, where it is persistently stored and can be queried and analyzed. To support materialized data integration, so called Extract-Transform-Load (ETL) systems have been built and are widely used to populate data warehouses today. While ETL is considered state-of-the-art in enterprise data warehousing, a new paradigm known as MapReduce has recently gained popularity for webscale data transformations, such as web indexing or page rank computation. The input data of both, ETL and MapReduce programs keeps changing over time, while business transactions are processed or the web is crawled, for instance. Hence, the results of ETL and MapReduce programs get stale and need to be recomputed from time to time. Recurrent computations over changing input data can be performed in two ways. The result may either be recomputed from scratch or recomputed in an incremental fashion. The idea behind the latter approach is to update the existing result in response to incremental changes in the input data. This is typically more efficient than the full recomputation approach, because reprocessing unchanged portions of the input data can often be avoided. Incremental recomputation techniques have been studied by the database research community mainly in the context of the maintenance of materialized views and have been adopted by all major commercial database systems today. However, neither today’s ETL tools nor MapReduce support incremental recomputation techniques. The situation of ETL and MapReduce programmers nowadays is thus much comparable to the situation of database programmers in the early 1990s. This thesis makes an effort to transfer incremental recomputation techniques into the ETL and MapReduce environments. This poses interesting research challenges, because these environments differ fundamentally from the relational world with regard to query and programming models, change data capture, transactional guarantees and consistency models. However, as this thesis will show, incremental recomputations are feasible in ETL and MapReduce and may lead to considerable efficiency improvements.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Framework for Supporting Data

This paper presents a framework for data integration currently under development in the Squirrel project. The framework is based on a special class of mediators, called Squirrel integration mediators. These mediators can support the traditional virtual and materialized approaches, and also hybrids of them. This permits considerable exibility when adapting to diverse data integration environment...

متن کامل

A Framework for Supporting Data IntegrationUsing

This paper presents a framework for data integration currently under development in the Squirrel project. The framework is based on a special class of mediators, called Squirrel integration mediators. These mediators can support the traditional virtual and materialized approaches, and also hybrids of them. In the Squirrel mediators, a relation in the integrated view can be supported as (a) full...

متن کامل

Deferred Incremental Refresh of XML Materialized Views

The view mechanism can provide the user with an appropriate portion of database through data filtering and integration. In the Web era where information proliferates, the view concept is also useful for XML documents in a Web-based information system. Views are often materialized for query performance, and in that case, their consistency needs to be maintained against the updates of the underly...

متن کامل

MOVIE: An incremental maintenance system for materialized object views

View materialization is an important technique for high performance query processing, data integration and replication. Solutions to the problem of incrementally maintaining materialized views are very relevant. So far, most work on this problem has been confined to relational settings and solutions have not been comprehensively evaluated. This paper describes MOVIE, a complete, implemented and...

متن کامل

Incremental Data Mining Using Concurrent Online Refresh of Materialized Data Mining Views

Data mining is an iterative process. Users issue series of similar data mining queries, in each consecutive run slightly modifying either the definition of the mined dataset, or the parameters of the mining algorithm. This model of processing is most suitable for incremental mining algorithms that reuse the results of previous queries when answering a given query. Incremental mining algorithms ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013